Management Method and Apparatus for Transaction Processing System, Device, and Medium

ABSTRACT

A data processing method related to the field of artificial intelligence includes adding an architecture parameter to each feature interaction item in a first model, to obtain a second model, where the first model is a factorization machine (FM)-based model, and the architecture parameter represents importance of a corresponding feature interaction item; performing optimization on architecture parameters in the second model to obtain the optimized architecture parameters; and obtaining, based on the optimized architecture parameters and the first model or the second model, a third model through feature interaction item deletion.

CROSS-REFERENCE TO RELATED APPLICATIONS

This is a continuation of International Patent Application No.PCT/CN2021/077375 filed on Feb. 23, 2021, which claims priority toChinese Patent Application No. 202010202053.7 filed on Mar. 20, 2020.The disclosures of the aforementioned applications are herebyincorporated by reference in their entireties.

TECHNICAL FIELD

This disclosure relates to the field of artificial intelligence, and inparticular, to a data processing method and an apparatus.

BACKGROUND

With rapid development of Internet technologies, an information overloadproblem occurs. To resolve the information overload problem, arecommender system (RS) emerges. Click-through rate (CTR) prediction isan important step in the recommender system. Whether to recommend acommodity needs to be determined based on a predicted CTR. In additionto a single feature, a feature interaction also needs to be consideredduring CTR prediction. To represent the feature interaction, afactorization machine (FM) model is proposed. The FM model includesfeature interaction items of all interactions of single features. In aconventional technology, a CTR prediction model is usually built basedon an FM.

A quantity of feature interaction items in the FM model increasesexponentially with an order of a feature interaction. Therefore, with anincreasingly higher order, the feature interaction items becomenumerous. As a result, there is an extremely large computing workload inFM model training. To resolve this problem, feature interactionselection (FIS) is proposed. Manual FIS is time-consuming andlabor-intensive. Therefore, automatic FIS (AutoFIS) is proposed in theindustry.

In an existing automatic FIS solution, search space formed by allpossible feature interaction subsets is searched for an optimal subset,to implement FIS. A search process consumes high energy and consumes alarge amount of computing power.

SUMMARY

This disclosure provides a data processing method and an apparatus, toreduce a computing workload and computing power consumption of FIS.

According to a first aspect, a data processing method is provided. Themethod includes adding an architecture parameter to each featureinteraction item in a first model, to obtain a second model, where thefirst model is an FM-based model, and the architecture parameterrepresents importance of a corresponding feature interaction item,performing optimization on architecture parameters in the second model,to obtain the optimized architecture parameters, obtaining, based on theoptimized architecture parameters and the first model or the secondmodel, a third model through feature interaction item deletion.

The FM-based model represents a model built based on the FM principle,for example, includes any one of the following models: an FM model, aDeepFM model, an Incremental Probabilistic Neural Network (IPNN) model,an Attentional FM (AFM) model, and a Neural FM (NFM) model.

The third model may be a model obtained through feature interaction itemdeletion based on the first model.

Alternatively, the third model may be a model obtained through featureinteraction item deletion based on the second model.

A feature interaction item to be deleted or retained (or selected) maybe determined in a plurality of manners.

Optionally, in an implementation, a feature interaction itemcorresponding to an architecture parameter in the optimized architectureparameters whose value is less than a threshold may be deleted.

The threshold represents a criterion for determining whether to retain afeature interaction item. For example, if a value of an optimizedarchitecture parameter of a feature interaction item is less than thethreshold, it indicates that the feature interaction item is to bedeleted. If a value of an optimized architecture parameter of a featureinteraction item reaches the threshold, it indicates that the featureinteraction item is to be retained (or selected).

The threshold may be determined based on an actual applicationrequirement. For example, a value of the threshold may be obtainedthrough model training. A manner of obtaining the threshold is notlimited in this disclosure.

Optionally, in another implementation, if values of some architectureparameters change to zero after optimization is completed, featureinteraction items corresponding to the optimized architecture parameterswhose values are not zero may be directly used as retained featureinteraction items, to obtain the third model.

Optionally, in still another implementation, if values of somearchitecture parameters change to zero after optimization is completed,a feature interaction item corresponding to an architecture parameterwhose value is less than the threshold may be further deleted fromfeature interaction items corresponding to the optimized architectureparameters whose values are not zero, to obtain the third model.

In an existing automatic FIS solution, all possible feature interactionsubsets are used as search space, and a best candidate subset isselected from n randomly selected candidate subsets by using a discretealgorithm as a selected feature interaction. Training needs to beperformed once for evaluating each candidate subset, resulting in alarge computing workload and high computing power consumption.

In this disclosure, the architecture parameters are introduced into theFM-based model, so that feature interaction item selection can beperformed through optimization on the architecture parameters. In otherwords, in this disclosure, provided that optimization on thearchitecture parameters is performed once, feature interaction itemselection can be performed, and training for a plurality of candidatesubsets in a conventional technology is not required. Therefore, thiscan effectively reduce a computing workload of FIS to save computingpower, and improve efficiency of FIS.

In addition, an existing automatic FIS solution cannot be applied to adeep learning model with a long training period, because of the largecomputing workload and high computing power consumption.

In this disclosure, FIS can be performed through an optimization processof the architecture parameters. Alternatively, feature interaction itemselection can be completed through one end-to-end model trainingprocess, so that a period for feature interaction item selection (orsearch) may be equivalent to a period for one model training. Therefore,FIS can be applied to a deep learning model with a long training period.

In the FM model in the conventional technology, because all featureinteractions need to be enumerated, it is difficult to extend to ahigher order.

In this disclosure, the architecture parameters are introduced into theFM-based model, so that FIS can be performed through optimization on thearchitecture parameters. Therefore, in the solution of this disclosure,the feature interaction item in the FM-based model can be extended to ahigher order.

Optimization may be performed on the architecture parameters in thesecond model by using a plurality of optimization algorithms (oroptimizers).

With reference to the first aspect, in a possible implementation of thefirst aspect, optimization allows the optimized architecture parametersto be sparse.

In this disclosure, optimization on the architecture parameters allowsthe architecture parameters to be sparse, facilitating subsequentfeature interaction item deletion.

Optionally, in an implementation in which optimization allows theoptimized architecture parameters to be sparse, obtaining, based on theoptimized architecture parameters and the first model or the secondmodel, a third model through feature interaction item deletion includesobtaining, based on the first model or the second model, the third modelby deleting a feature interaction item corresponding to an architectureparameter in the optimized architecture parameters whose value is lessthan a threshold.

In an implementation, in the first model, the third model is obtained bydeleting a feature interaction item corresponding to an architectureparameter in the optimized architecture parameters whose value is lessthan a threshold.

In another implementation, in the second model, the third model isobtained by deleting a feature interaction item corresponding to anarchitecture parameter in the optimized architecture parameters whosevalue is less than a threshold.

It should be understood that the third model is obtained through featureinteraction item deletion based on the second model, so that the thirdmodel has the optimized architecture parameters that representimportance of the feature interaction items. Subsequently, importance ofthe feature interaction items can be further learned through training ofthe third model.

With reference to the first aspect, in a possible implementation of thefirst aspect, optimization allows a value of an architecture parameterof at least one feature interaction item to be equal to zero afteroptimization is completed.

It is assumed that a feature interaction item corresponding to anarchitecture parameter whose value is zero after optimization iscompleted is considered as an unimportant feature interaction item. Thatoptimization allows a value of an architecture parameter of at least onefeature interaction item to be equal to zero after optimization iscompleted may be considered as allowing the value of the architectureparameter of the unimportant feature interaction item to be equal tozero after optimization is completed.

Optionally, optimization is performed on the architecture parameters inthe second model using a generalized regularized dual averaging (gRDA)optimizer, where the gRDA optimizer allows the value of the architectureparameter of the at least one feature interaction item to tend to zeroduring an optimization process.

In embodiments of this disclosure, optimization on the architectureparameters allows some architecture parameters to tend to zero, which isequivalent to removing some unimportant feature interaction items in anarchitecture parameter optimization process. In other words,optimization on the architecture parameters implements architectureparameter optimization and feature interaction item selection. This canimprove efficiency of FIS and reduce a computing workload and computingpower consumption.

In addition, in the architecture parameter optimization process,removing some unimportant feature interaction items can prevent noisegenerated by these unimportant feature interaction items. In this case,a model gradually evolves into an ideal model in the architectureparameter optimization process. In addition, prediction of otherparameters (for example, architecture parameters and model parameters ofan unremoved feature interaction item) in the model can be moreaccurate.

Optionally, in an implementation in which optimization allows theoptimized architecture parameters to be sparse and allows a value of anarchitecture parameter of at least one feature interaction item to beequal to zero after optimization is completed, obtaining, based on theoptimized architecture parameters and the first model or the secondmodel, a third model through feature interaction item deletion includesobtaining the third model by deleting a feature interaction item otherthan feature interaction items corresponding to the optimizedarchitecture parameters.

Optionally, in the first model, the third model is obtained by deletingthe feature interaction item other than the feature interaction itemscorresponding to the optimized architecture parameters. In other words,the third model is obtained through feature interaction item deletionbased on the first model.

Optionally, the second model obtained through architecture parameteroptimization is used as the third model. In other words, the third modelis obtained through feature interaction item deletion based on thesecond model.

It should be understood that the third model is obtained through featureinteraction item deletion based on the second model, so that the thirdmodel has the optimized architecture parameters that representimportance of the feature interaction items. Subsequently, importance ofthe feature interaction items can be further learned through training ofthe third model.

Optionally, in an implementation in which optimization allows theoptimized architecture parameters to be sparse and allows a value of anarchitecture parameter of at least one feature interaction item to beequal to zero after optimization is completed, obtaining, based on theoptimized architecture parameters and the first model or the secondmodel, a third model through feature interaction item deletion includesobtaining the third model by deleting a feature interaction item otherthan feature interaction items corresponding to the optimizedarchitecture parameters and deleting the feature interaction itemcorresponding to the architecture parameter in the optimizedarchitecture parameters whose value is less than the threshold.

Optionally, in the first model, the third model is obtained by deletingthe feature interaction item other than the feature interaction itemscorresponding to the optimized architecture parameters and deleting thefeature interaction item corresponding to the architecture parameter inthe optimized architecture parameters whose value is less than thethreshold.

Optionally, in the second model obtained through architecture parameteroptimization, the third model is obtained by deleting a featureinteraction item corresponding to an architecture parameter in theoptimized architecture parameters whose value is less than a threshold.

It should be understood that the third model is obtained through featureinteraction item deletion based on the second model, so that the thirdmodel has the optimized architecture parameters that representimportance of the feature interaction items. Subsequently, importance ofthe feature interaction items can be further learned through training ofthe third model.

With reference to the first aspect, in a possible implementation of thefirst aspect, the method further includes performing optimization onmodel parameters in the second model, where optimization includesscalarization processing on the model parameters in the second model.

The model parameters indicate weight parameters other than thearchitecture parameters of the feature interaction item in the secondmodel. In other words, the model parameters represent an originalparameter in the first model.

In an implementation, optimization includes performing batchnormalization (BN) processing on the model parameters in the secondmodel.

It should be understood that scalarization processing is performed onthe model parameters of the feature interaction item, to decouple themodel parameters from the architecture parameters of the featureinteraction item. In this case, the architecture parameters can moreaccurately reflect importance of the feature interaction items, furtherimproving optimization accuracy of the architecture parameters.

With reference to the first aspect, in a possible implementation of thefirst aspect, performing optimization on architecture parameters in thesecond model and the performing optimization on model parameters in thesecond model include performing simultaneous optimization on both thearchitecture parameters and the model parameters in the second model byusing same training data, to obtain the optimized architectureparameters.

In other words, in each round of training in an optimization process,simultaneous optimization is performed on both the architectureparameters and the model parameters based on a same batch of trainingdata.

Alternatively, the architecture parameters and the model parameters inthe second model are considered as decision variables at a same level,and simultaneous optimization is performed on both the architectureparameters and the model parameters in the second model, to obtain theoptimized architecture parameters.

In this disclosure, one-level optimization processing is performed onthe architecture parameters and the model parameters in the secondmodel, to implement optimization on the architecture parameters in thesecond model, so that simultaneous optimization can be performed on thearchitecture parameters and the model parameters. Therefore, timeconsumed in an optimization process of the architecture parameters inthe second model can be reduced, to further help improve efficiency offeature interaction item selection.

With reference to the first aspect, in a possible implementation of thefirst aspect, the method further includes training the third model toobtain a CTR prediction model or a conversion rate (CVR) predictionmodel.

According to a second aspect, a data processing method is provided. Themethod includes inputting data of a target object into a CTR predictionmodel or a CVR prediction model, to obtain a prediction result of thetarget object, and determining a recommendation status of the targetobject based on the prediction result of the target object.

The CTR prediction model or the CVR prediction model is obtained throughthe method in the first aspect.

Training of a third model includes the following step: train the thirdmodel by using a training sample of the target object, to obtain the CTRprediction model or the CVR prediction model.

Optionally, optimization on architecture parameters includes thefollowing step: perform simultaneous optimization on both thearchitecture parameters and model parameters in a second model by usingthe same training data as that in the training sample of the targetobject, to obtain the optimized architecture parameters.

According to a third aspect, a data processing apparatus is provided.The apparatus includes the following units.

A first processing unit is configured to add an architecture parameterto each feature interaction item in a first model, to obtain a secondmodel, where the first model is an FM-based model, and the architectureparameter represents importance of a corresponding feature interactionitem.

A second processing unit is configured to perform optimization onarchitecture parameters in the second model, to obtain the optimizedarchitecture parameters.

A third processing unit is configured to obtain, based on the optimizedarchitecture parameters and the first model or the second model, a thirdmodel through feature interaction item deletion.

With reference to the third aspect, in a possible implementation of thethird aspect, the second processing unit performs optimization on thearchitecture parameters, to allow the optimized architecture parametersto be sparse.

With reference to the third aspect, in a possible implementation of thethird aspect, the third processing unit is configured to obtain, basedon the first model or the second model, the third model by deleting afeature interaction item corresponding to an architecture parameter inthe optimized architecture parameters whose value is less than athreshold.

With reference to the third aspect, in a possible implementation of thethird aspect, the second processing unit performs optimization on thearchitecture parameters, to allow a value of an architecture parameterof at least one feature interaction item to be equal to zero afteroptimization is completed.

With reference to the third aspect, in a possible implementation of thethird aspect, the third processing unit is configured to optimize thearchitecture parameters in the second model using a gRDA optimizer,where the gRDA optimizer allows the value of the architecture parameterof the at least one feature interaction item to tend to zero during anoptimization process.

With reference to the third aspect, in a possible implementation of thethird aspect, the second processing unit is further configured toperform optimization on model parameters in the second model, whereoptimization includes scalarization processing on the model parametersin the second model.

With reference to the third aspect, in a possible implementation of thethird aspect, the second processing unit is configured to perform BNprocessing on the model parameters in the second model.

With reference to the third aspect, in a possible implementation of thethird aspect, the second processing unit is configured to performsimultaneous optimization on both the architecture parameters and modelparameters in a second model by using same training data, to obtain theoptimized architecture parameters.

With reference to the third aspect, in a possible implementation of thethird aspect, the apparatus further includes a training unit configuredto train the third model, to obtain a CTR prediction model or a CVRprediction model.

According to a fourth aspect, a data processing apparatus is provided.The apparatus includes the following units.

A first processing unit is configured to input data of a target objectinto a CTR prediction model or a CVR prediction model, to obtain aprediction result of the target object.

A first processing unit is configured to determine a recommendationstatus of the target object based on the prediction result of the targetobject.

The CTR prediction model or the CVR prediction model is obtained throughthe method in the first aspect.

Training of a third model includes the following step: train the thirdmodel by using a training sample of the target object, to obtain the CTRprediction model or the CVR prediction model.

Optionally, optimization on architecture parameters includes thefollowing step: perform simultaneous optimization on both thearchitecture parameters and model parameters in a second model by usingthe same training data as that in the training sample of the targetobject, to obtain the optimized architecture parameters.

According to a fifth aspect, an image processing apparatus is provided.The apparatus includes a memory configured to store a program, and aprocessor configured to execute the program stored in the memory, wherewhen the program stored in the memory is being executed, the processoris configured to perform the method in the first aspect or the secondaspect.

According to a sixth aspect, a computer-readable medium is provided. Thecomputer-readable medium stores program code to be executed by a device,and the program code is used to perform the method in the first aspector the second aspect.

According to a seventh aspect, a computer program product includinginstructions is provided. When the computer program product runs on acomputer, the computer is enabled to perform the method in the firstaspect or the second aspect.

According to an eighth aspect, a chip is provided. The chip includes aprocessor and a data interface. The processor reads, through the datainterface, instructions stored in a memory, to perform the method in thefirst aspect or the second aspect.

Optionally, in an implementation, the chip may further include a memoryand the memory stores instructions, the processor is configured toexecute the instructions stored in the memory, and when the instructionsare executed, the processor is configured to perform the methods in thefirst aspect or the second aspect.

According to a ninth aspect, an electronic device is provided. Theelectronic device includes the apparatus provided in the third aspect,the fourth aspect, the fifth aspect, or the sixth aspect.

It can be learned from the foregoing description that, in thisdisclosure, the architecture parameters are introduced into the FM-basedmodel, so that feature interaction item selection can be performedthrough optimization on the architecture parameters. In other words, inthis disclosure, feature interaction item selection can be performedthrough optimization on the architecture parameters, and training for aplurality of candidate subsets in the conventional technology is notrequired. Therefore, this can effectively reduce a computing workload ofFIS to save computing power, and improve efficiency of FIS.

In addition, in the solution provided in this disclosure, the featureinteraction item in the FM-based model can be extended to a higherorder.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a schematic diagram of an FM model architecture;

FIG. 2 is a schematic diagram of FM model training;

FIG. 3 is a schematic flowchart of a data processing method according toan embodiment of this disclosure;

FIG. 4 is a schematic diagram of an FM model architecture according toan embodiment of this disclosure;

FIG. 5 is another schematic flowchart of a data processing methodaccording to an embodiment of this disclosure;

FIG. 6 is still another schematic flowchart of a data processing methodaccording to an embodiment of this disclosure;

FIG. 7 is a schematic block diagram of a data processing apparatusaccording to an embodiment of this disclosure;

FIG. 8 is another schematic block diagram of a data processing apparatusaccording to an embodiment of this disclosure;

FIG. 9 is still another schematic block diagram of a data processingapparatus according to an embodiment of this disclosure; and

FIG. 10 is a schematic diagram of a hardware architecture of a chipaccording to an embodiment of this disclosure.

DESCRIPTION OF EMBODIMENTS

The following describes technical solutions of this disclosure withreference to accompanying drawings.

With rapid development of current technologies, there is an increasingamount of data. To solve an information overload problem, a recommendersystem (recommender system, RS) is proposed. The recommender systemsends historical behavior, interests, preferences, or demographicfeatures of a user to a recommendation algorithm, and then uses therecommendation algorithm to generate a list of items that the user maybe interested in.

In the recommender system, CTR prediction (or further including CVRprediction) is a very important step. Whether to recommend a commodityneeds to be determined based on a predicted CTR. In addition to a singlefeature, a feature interaction also needs to be considered during CTRprediction. The feature interaction is very important for recommendationranking. An FM can reflect the feature interaction. The FM may bereferred to as an FM model.

Based on a maximum order of the feature interaction item, the FM modelmay be referred to as a *-order FM model. For example, an FM model whosefeature interaction item has a maximum of a second order may be referredto as a second-order FM model, and an FM model whose feature interactionitem has a maximum of a third order may be referred to as a third-orderFM model.

An order of the feature interaction item indicates a specific quantityof features corresponding to the feature interaction item. For example,an interaction item of two features may be referred to as a second-orderfeature interaction item, and an interaction item of three features maybe referred to as a third-order feature interaction item.

In an example, the second-order FM model is shown in the followingformula (1):

$\begin{matrix}{{\hat{y}(x)}:={w_{0} + {\sum\limits_{i = 1}^{m}{w_{i}x_{i}}} + {\sum\limits_{i = 1}^{m}{\sum\limits_{j = {i + 1}}^{m}{\left\langle {v_{i},v_{j}} \right\rangle x_{i}x_{j}}}}}} & (1)\end{matrix}$$\left\langle {v_{i},v_{j}} \right\rangle:={\sum\limits_{f = 1}^{k}{v_{i,f} \cdot {v_{j,f}.}}}$

x indicates a feature vector, x_(i) indicates an i^(th) feature, andx_(j) indicates a j^(th) feature. m indicates a quantity of features,and may also be referred to as a feature field. w₀ indicates a globaloffset, and w₀∈R. w_(i) indicates strength of the ith feature, andw∈R^(m). v_(i) indicates an auxiliary vector of the ith feature x_(i).v_(j) indicates an auxiliary vector of the jth feature x_(j). k is aquantity of v_(i) and v_(j). A two-dimensional matrix v∈R^(m×k).

x_(i)x_(j) indicates a combination of the ith feature x_(i) and the jthfeature x_(j).

v_(i), v_(j)

indicates an inner product of v_(i) and v_(j), and indicates interactionbetween the ith feature x_(i) and the jth feature x_(j).

v_(i), v_(j)

may also be understood as a weight parameter of a feature interactionitem x_(i)x_(j), for example,

v_(i), v_(j)

may be denoted as w_(ij).

In this specification,

v_(i), v_(j)

is denoted as a weight parameter of a feature interaction itemx_(i)x_(j).

Optionally, the formula (1) may also be expressed as the followingformula (2):

$\begin{matrix}{l_{fm} = {\left\langle {w,x} \right\rangle + {\sum\limits_{i = 1}^{m}{\sum\limits_{j > i}^{m}{\left\langle {e_{i},e_{j}} \right\rangle.}}}}} & (2)\end{matrix}$

In the formula (2),

e_(i), e_(j)

indicates

v_(i), v_(j)

x_(i)x_(j) in the formula (1), and (w, x) indicates w₀=Σ_(i=1)^(m)w_(i)x_(i) in the formula (1).

In another example, the third-order FM model is shown in the followingformula (3):

$\begin{matrix}{l_{fm}^{3{rd}} = {\left\langle {w,x} \right\rangle + {\sum\limits_{i = 1}^{m}{\sum\limits_{j > i}^{m}\left\langle {e_{i},e_{j}} \right\rangle}} + {\sum\limits_{i = 1}^{m}{\sum\limits_{j > i}^{m}{\sum\limits_{t > j}^{m}{\left\langle {e_{i},e_{j},e_{t}} \right\rangle.}}}}}} & (3)\end{matrix}$

The FM model includes feature interaction items of all interactions ofsingle features. For example, a second-order FM model shown in theformula (1) or the formula (2) includes feature interaction items of allsecond-order feature interactions of single features. For anotherexample, the third-order FM model shown in the formula (3) includesfeature interaction items of all second-order feature interactions ofsingle features and feature interaction items of all third-order featureinteractions of single features.

For example, in the industry, an operation of obtaining an auxiliaryvector v_(i) of a feature x_(i) is referred to as embedding, and anoperation of building a feature interaction item based on the featurex_(i) and the auxiliary vector v_(i) thereof is referred to asinteraction. FIG. 1 is a schematic diagram of an FM model architecture.As shown in FIG. 1 , an FM model may be considered as a neural networkmodel, and includes an input layer, an embedding layer, an interactionlayer, and an output layer. The input layer is used to generate afeature. A field 1, a field, . . . , and a field m indicate a quantityof features. The embedding layer is used to generate an auxiliary vectorof the feature. The interaction layer is used to generate a featureinteraction item based on the feature and the auxiliary vector of thefeature. The output layer is used to output the FM model.

In the conventional technology, CTR prediction or CVR prediction isusually based on an FM.

In the current technology, an FM-based model includes an FM model, aDeepFM model, an IPNN model, an AFM model, an NFM model, and the like.

As an example instead of a limitation, a procedure of building an FMmodel is shown in FIG. 2 .

S210: Enumerate and enter feature interaction items into the FM model.

For example, the FM model is built by using the formula (1) or theformula (3).

S220: Train the FM model until convergence, to obtain an FM model thatcan be put into use.

After FM model training is completed, online inference may be performedby using the trained FM model, as shown in step S230 in FIG. 2 .

As described above, the FM model includes the feature interaction itemsof all interactions of single features. Therefore, FM model training hasan extremely large computing workload and consumes a lot of time.

In addition, it can be learned from the formula (1) and the formula (3)that a quantity of feature interaction items in the FM model increasessharply with increases in the quantity of features and an order offeature interaction.

For example, in the formula (1), as the quantity m of featuresincreases, the quantity of feature interaction items increasesexponentially. For another example, as the order of feature interactionincreases in a switch from a second-order FM model to a third-order FMmodel, the quantity of feature interaction items in the FM modelincreases greatly.

Therefore, increases in the quantity of features and the order offeature interaction results in a huge burden to an inference delay and acomputing workload of the FM model. Therefore, a maximum quantity offeatures and the order of feature interaction that can be accommodatedby the FM model are limited. For example, it is difficult to extend theFM model in the current technology to a higher order.

To resolve this problem, FIS is proposed.

In some conventional technologies, FIS is performed in a manualselection manner. To select good feature interactions may take manyyears of exploration by engineers. This manual selection manner consumesa large amount of manpower, and may miss an important featureinteraction item.

To address a disadvantage of manual selection, an AutoFIS solution isproposed in the industry. Compared with manual selection, valuablefeature interactions can be selected through automatic FIS in a shortperiod of time.

In the current technology, an automatic FIS solution is proposed. In thesolution, all possible feature interaction subsets are used as searchspace, and a best candidate subset is selected from n randomly selectedcandidate subsets by using a discrete algorithm as a selected featureinteraction. Training needs to be performed once for evaluating eachcandidate subset, resulting in a large computing workload and highcomputing power consumption. In addition, when each candidate subset isevaluated, entire model training can improve evaluation accuracy, butmay cause huge search energy consumption (search cost), mini-batchtraining that is used for approximation may result in inaccurateevaluation. In addition, in this solution, as the order of the featureinteraction increases, the search space increases exponentially, whichincreases energy consumption in a search process.

Therefore, the existing automatic FIS solution has a large computingworkload, high energy consumption in a search process, and highcomputing power consumption.

For the foregoing problem, this disclosure provides an automatic FISsolution. Compared with the conventional technology, this solution canreduce computing power consumption of automatic FIS, and improveefficiency of automatic FIS.

FIG. 3 is a schematic flowchart of a data processing method 300according to an embodiment of this disclosure. The method 300 includesthe following steps: S310, S320, and S330.

S310: Add an architecture parameter to each feature interaction item ina first model, to obtain a second model.

The first model is a model based on an FM. In other words, the firstmodel includes feature interaction items of all interactions of singlefeatures, or the first model enumerates feature interaction items of allinteractions.

For example, the first model may be any one of the following FM-basedmodels: an FM model, a DeepFM model, an IPNN model, an AFM model, and anNFM model.

As an example, the first model is a second-order FM model shown in theformula (1) or the formula (2), or the first model is a third-order FMmodel shown in the formula (3).

In this disclosure, feature interaction item selection is performed, andthe first model may be considered as a model on which a featureinteraction item is to be deleted.

Adding an architecture parameter to each feature interaction item in thefirst model means adding a coefficient to each feature interaction itemin the first model. In this disclosure, the coefficient is referred toas an architecture parameter. The architecture parameter representsimportance of a corresponding feature interaction item. A model obtainedby adding the architecture parameter each feature interaction item inthe first model is denoted as a second model.

In an example, assuming that the first model is a second-order FM modelshown in the formula (1), the second model is shown in the followingformula (4):

$\begin{matrix}{{\hat{y}(x)}:={w_{0} + {\sum\limits_{i = 1}^{m}{w_{i}x_{i}}} + {\sum\limits_{i = 1}^{m}{\sum\limits_{j = {i + 1}}^{m}{\alpha_{({i,j})}\left\langle {v_{i},v_{j}} \right\rangle x_{i}x_{j}}}}}} & (4)\end{matrix}$$\left\langle {v_{i},v_{j}} \right\rangle:={\sum\limits_{f = 1}^{k}{v_{i,f} \cdot {v_{j,f}.}}}$

x indicates a feature vector, x_(i) indicates an ith feature, and x_(j)indicates a jth feature. m indicates a feature dimension, and may alsobe referred to as a feature field. w₀ indicates a global offset, andw₀∈R. w_(i) indicates strength of the ith feature, and w∈R^(m). v_(i)indicates an auxiliary vector of the ith feature x_(i). v_(j) indicatesan auxiliary vector of the jth feature x_(j). k is a quantity of v_(i)and v_(j). A two-dimensional matrix v∈R^(m×k).

x_(i)x_(j) indicates a combination of the ith feature x_(i) and the jthfeature x_(j).

v_(i), v_(j)

indicates a weight parameter of a feature interaction item x_(i)x_(j),and α_((i,j)) indicates an architecture parameter of the featureinteraction item x_(i)x_(j).

v_(i), v_(j)

indicates an inner product of v_(i) and v_(j), and indicates interactionbetween the ith feature x_(i) and the jth feature x_(j).

v_(i), v_(j)

may also be understood as a weight parameter of a feature interactionitem, for example,

v_(i), v_(j)

may be denoted as w_(ij).

Assuming that the first model is expressed as the second-order FM modelshown in the formula (2), the second model may be expressed as thefollowing formula (5):

$\begin{matrix}{l_{autoFIS} = {\left\langle {w,x} \right\rangle + {\sum\limits_{i = 1}^{m}{\sum\limits_{j > i}^{m}{\alpha_{({i,j})}{\left\langle {e_{i},e_{j}} \right\rangle.}}}}}} & (5)\end{matrix}$

α_((i,j)) indicates an architecture parameter of a feature interactionitem.

In another example, if the first model is a third-order FM model shownin the formula (3), the second model is shown in the following formula(6):

$\begin{matrix}{l_{autoFIS} = {\left\langle {w,x} \right\rangle + {\sum\limits_{i = 1}^{m}{\sum\limits_{j > i}^{m}{\alpha_{({i,j})}\left\langle {e_{i},e_{j}} \right\rangle}}} + {\sum\limits_{i = 1}^{m}{\sum\limits_{j > i}^{m}{\sum\limits_{t > j}^{m}{\alpha_{({i,j,t})}{\left\langle {e_{i},e_{j},e_{t}} \right\rangle.}}}}}}} & (6)\end{matrix}$

α_((i,j)) and α_((i,j,t)) indicate architecture parameters of featureinteraction items.

For ease of understanding and description, the following is agreed inthis specification. An original weight parameter (for example,

v_(i), v_(j)

in the formula (4)) of the feature interaction item in the first modelis referred to as a model parameter.

In other words, in the second model, each feature interaction item hastwo types of coefficient parameters: a model parameter and anarchitecture parameter.

FIG. 4 is a schematic diagram of feature interaction item selectionaccording to an embodiment of this disclosure. An embedding layer and aninteraction layer in FIG. 4 have the same meanings as those of theembedding layer and the interaction layer in FIG. 1 . As shown in FIG. 4, in this embodiment of this disclosure, architecture parametersα_((i,j)) (α_((1,2)), α_((1,m)), and α_((m-1,m)) shown in FIG. 4 ) areadded to feature interaction items at the interaction layer. Theinteraction layer in FIG. 4 may be considered as a first model, and theinteraction layer to which the architecture parameters α_((i,j)) areadded to the feature interaction items may be considered as a secondmodel.

S320: Perform optimization on architecture parameters in the secondmodel, to obtain the optimized architecture parameters.

For example, optimization is performed on the architecture parameters inthe second model by using training data, to obtain the optimizedarchitecture parameters.

For example, the optimized architecture parameters may be considered asoptimal values α* of the architecture parameters in the second model.

In embodiments of this disclosure, the architecture parameter representsimportance of a corresponding feature interaction item. Therefore,optimization on the architecture parameter is equivalent to learningimportance of each feature interaction item or a contribution degree ofeach feature interaction item to model prediction. In other words, theoptimized architecture parameter represents importance of the learnedfeature interaction item.

In other words, in embodiments of this disclosure, contribution (orimportance) of each feature interaction item may be learned by using thearchitecture parameters in an end-to-end manner.

S330: Obtain, based on the optimized architecture parameters and thefirst model or the second model, a third model through featureinteraction item deletion.

The third model may be a model obtained through feature interaction itemdeletion based on the first model.

Alternatively, the third model may be a model obtained through featureinteraction item deletion based on the second model.

A feature interaction item to be deleted or retained (or selected) maybe determined in a plurality of manners.

Optionally, in an implementation, a feature interaction itemcorresponding to an architecture parameter in the optimized architectureparameters whose value is less than a threshold may be deleted.

The threshold represents a criterion for determining whether to retain afeature interaction item. For example, if a value of an optimizedarchitecture parameter of a feature interaction item is less than thethreshold, it indicates that the feature interaction item is to bedeleted. If a value of an optimized architecture parameter of a featureinteraction item reaches the threshold, it indicates that the featureinteraction item is to be retained (or selected).

The threshold may be determined based on an actual applicationrequirement. For example, a value of the threshold may be obtainedthrough model training. A manner of obtaining the threshold is notlimited in this disclosure.

Still refer to FIG. 4 . Assuming that the optimized architectureparameter α_((1,2)) is less than the threshold, a feature interactionitem corresponding to the architecture parameter α_((1,2)) may bedeleted. Assuming that the optimized architecture parameter α_((1,m))reaches the threshold, a feature interaction item corresponding to thearchitecture parameter α_((1,m)) may be retained. A next-layer model isobtained by deleting the feature interaction item based on the optimizedarchitecture parameter, as shown in FIG. 4 . The third model in theembodiment in FIG. 3 is, for example, the next-layer model shown in FIG.4 .

As an example, instead of a limitation, as shown in FIG. 4 , anoperation of determining, based on the optimized architecture parameter,whether to delete a corresponding feature interaction item may bedenoted as a selection gate.

It should be noted that FIG. 4 is merely an example rather than alimitation.

Optionally, in another implementation, if values of some architectureparameters change to zero after optimization is completed, featureinteraction items corresponding to the optimized architecture parameterswhose values are not zero may be directly used as retained featureinteraction items, to obtain the third model.

Optionally, in still another implementation, if values of somearchitecture parameters change to zero after optimization is completed,a feature interaction item corresponding to an architecture parameterwhose value is less than the threshold may be further deleted fromfeature interaction items corresponding to the optimized architectureparameters whose values are not zero, to obtain the third model.

In this specification, a “model obtained through feature interactionitem deletion” can be replaced with a “model obtained through featureinteraction item selection”.

As described above, in an existing automatic FIS solution, all possiblefeature interaction subsets are used as search space, and a bestcandidate subset is selected from n randomly selected candidate subsetsby using a discrete algorithm as a selected feature interaction.Training needs to be performed once for evaluating each candidatesubset, resulting in a large computing workload and high computing powerconsumption.

In embodiments of this disclosure, the architecture parameters areintroduced into the FM-based model, so that feature interaction itemselection can be performed through optimization on the architectureparameters. In other words, in this disclosure, provided thatoptimization on the architecture parameters is performed once, featureinteraction item selection can be performed, and training for aplurality of candidate subsets in a conventional technology is notrequired. Therefore, this can effectively reduce a computing workload ofFIS to save computing power, and improve efficiency of FIS.

In addition, in an existing automatic FIS solution, FIS is performed bysearching for a candidate subset in search space. It may be understoodthat, in the conventional technology, FIS is resolved as a discreteissue, in other words, a discrete feature interaction candidate set issearched for.

In this embodiment of this disclosure, FIS is performed throughoptimization on the architecture parameters that are introduced into theFM-based model. It may be understood that, in this embodiment of thisdisclosure, an existing problem of searching for the discrete featureinteraction candidate set is continuous, in other words, FIS is resolvedas a continuous issue. For example, the automatic FIS solution providedin this disclosure may be expressed as a feature interaction searchsolution based on continuous search space. In other words, in thisembodiment of this disclosure, an operation of introducing thearchitecture parameters into the FM-based model may be considered ascontinuous modeling for automatic feature interaction item selection.

In addition, an existing automatic FIS solution cannot be applied to adeep learning model with a long training period, because of the largecomputing workload and high computing power consumption.

In embodiments of this disclosure, FIS can be performed through anoptimization process of the architecture parameters. Alternatively,feature interaction item selection can be completed through oneend-to-end model training process, so that a period for featureinteraction item selection (or search) may be equivalent to a period forone model training. Therefore, FIS can be applied to a deep learningmodel with a long training period.

As described above, in the FM model in the conventional technology,because all feature interactions need to be enumerated, it is difficultto extend to a higher order.

In embodiments of this disclosure, the architecture parameters areintroduced into the FM-based model, so that FIS can be performed throughoptimization on the architecture parameters. Therefore, in the solutionprovided in embodiments of this disclosure, the feature interaction itemin the FM-based model can be extended to a higher order.

For example, an FM model built by using the solution provided inembodiments of this disclosure may be extended to a third order or ahigher order.

For another example, the DeepFM model built by using the solutionprovided in this embodiment of this disclosure may be extended to athird order or a higher order.

In embodiments of this disclosure, the architecture parameters areintroduced into the conventional FM-based model, so that FIS can beperformed through optimization on the architecture parameters. In otherwords, in embodiments of this disclosure, the FM-based model thatincludes the architecture parameters is built, and FIS can be performedby performing optimization on the architecture parameters. A method forbuilding the FM-based model that includes the architecture parameters isadding the architecture parameter before each feature interaction itemin the conventional FM-based model.

As shown in FIG. 3 , the method 300 may include step S340.

S340: Train the third model.

Step S340 may also be understood as performing model training again. Itmay be understood that the feature interaction item is deleted by usingstep S310, step S320, and step S330. In step S340, the model obtainedthrough feature interaction item deletion is retrained.

In step S340, the third model may be directly trained, or the thirdmodel may be trained after a L1 regular term and/or a L2 regular termare/is added to the third model.

For example, an objective of training the third model may be determinedbased on an application requirement.

For example, assuming that a CTR prediction model is to be obtained, thethird model is trained by using the obtained CTR prediction model as thetraining objective, to obtain the CTR prediction model.

For another example, assuming that a CVR prediction model is to beobtained, the third model is trained by using the conversion rate, CVRprediction model as the training objective, to obtain the CVR predictionmodel.

In step S320, for example, optimization may be performed on thearchitecture parameters in the second model by using a plurality ofoptimization algorithms (or optimizers).

A first optimization algorithm:

Optionally, in step S320, optimization is performed on the architectureparameters, to allow the optimized architecture parameters to be sparse.

For example, in step S320, optimization is performed on the architectureparameters in the second model by using least absolute shrinkage andselection operator (Lasso) regularization.

That the formula (5) is used in the second model is used as an example.In step S320, the architecture parameters in the second model areoptimized by using the following formula (7):

$\begin{matrix}{L_{search} = {{L_{\alpha,w}\left( {y,{\hat{y}}_{M}} \right)} + {\lambda{\sum\limits_{i,{j > i}}{{❘\alpha_{({i,j})}❘}.}}}}} & (7)\end{matrix}$

L_(α,w)(y, ŷ_(M)) indicates a loss function. y indicates a modelobserved value. ŷ_(M) indicates a model predicted value. λ indicates aconstant, and its value may be assigned based on a specific requirement.

It should be understood that the formula (7) indicates a constraintcondition for architecture parameter optimization.

The optimized architecture parameters are sparse, facilitatingsubsequent feature interaction item deletion.

Optionally, in an embodiment, in step S320, optimization on thearchitecture parameters allows the optimized architecture parameters tobe sparse, in step S330, the third model is obtained, based on the firstmodel or the second model, by deleting a feature interaction itemcorresponding to an architecture parameter in the optimized architectureparameters whose value is less than a threshold.

The threshold represents a criterion for determining whether to retain afeature interaction item. For example, if a value of an optimizedarchitecture parameter of a feature interaction item is less than thethreshold, it indicates that the feature interaction item is to bedeleted. If a value of an optimized architecture parameter of a featureinteraction item reaches the threshold, it indicates that the featureinteraction item is to be retained (or selected).

The threshold may be determined based on an actual applicationrequirement. For example, a value of the threshold may be obtainedthrough model training. A manner of obtaining the threshold is notlimited in this disclosure.

In embodiments of this disclosure, optimization on the architectureparameters allows the architecture parameters to be sparse, facilitatingfeature interaction item selection.

It should be understood that the architecture parameters in the secondmodel represent importance or a contribution degree of a correspondingfeature interaction. If a value of an optimized architecture parameteris less than the threshold, for example, close to zero, it indicatesthat a feature interaction item corresponding to the architectureparameter is not important or has a very low contribution degree.Deleting (or referred to as removing or cutting) such featureinteraction item can remove noise introduced by the feature interactionitem, reduce energy consumption, and improve an inference speed of amodel.

Therefore, deleting the feature interaction item corresponding to thearchitecture parameter in the optimized architecture parameters whosevalue is less than the threshold is an appropriate FIS operation.

A second optimization algorithm:

Optionally, in step S320, optimization is performed on the architectureparameters, so that the optimized architecture parameters are sparse anda value of an architecture parameter of at least one feature interactionitem is equal to zero after optimization is completed.

It is assumed that the feature interaction item corresponding to thearchitecture parameter whose value is zero after optimization iscompleted is considered as an unimportant feature interaction item.Optimization on the architecture parameters in step S320 may beconsidered as allowing the value of the architecture parameter of theunimportant feature interaction item to be equal to zero afteroptimization is completed.

In other words, optimization on the architecture parameters allows thevalue of the architecture parameter of the at least one featureinteraction item to tend to zero during an optimization process.

For example, in step S320, the architecture parameters in the secondmodel are optimized using a gRDA optimizer. The gRDA optimizer allowsthe architecture parameters to be sparse, and allows the value of thearchitecture parameter of the at least one feature interaction item togradually tend to zero during an optimization process.

For example, in step S320, the architecture parameters in the secondmodel are optimized by using the following formula (8):

$\begin{matrix}{\alpha_{t + 1} = {\underset{\alpha}{\arg\min}{\left\{ {\alpha^{T}\left( {{- \alpha_{0}} + {\gamma{\sum\limits_{i = 0}^{t}{\nabla{L\left( {\alpha_{t};y_{i + 1}} \right)}}}} + {{g\left( {t,\gamma} \right)}{\alpha }_{1}} + {\frac{1}{2}{\alpha }_{2}^{2}}} \right)} \right\}.}}} & (8)\end{matrix}$

y indicates a learning rate. y_(i+1) indicates a model observationvalue. g(t,γ)=cγ^(1/2)(tγ)^(μ). c and μ represent adjustablehyperparameters. An objective of adjusting c and μ is to find a balancebetween a model accuracy and sparseness of an architecture parameter α.

It should be understood that the formula (8) indicates a constraintcondition for architecture parameter optimization.

It should be further understood that, in this embodiment, in step S320,the second model obtained through architecture parameter optimization isa model obtained through feature interaction item selection.

In this disclosure, optimization on the architecture parameters allowssome architecture parameters to tend to zero, which is equivalent toremoving some unimportant feature interaction items in an architectureparameter optimization process. In other words, optimization on thearchitecture parameters implements architecture parameter optimizationand feature interaction item selection. This can improve efficiency ofFIS and reduce a computing workload and computing power consumption.

In addition, in the architecture parameter optimization process,removing some unimportant feature interaction items can prevent noisegenerated by these unimportant feature interaction items. In this case,a model gradually evolves into an ideal model in the architectureparameter optimization process. In addition, prediction of otherparameters (for example, architecture parameters and model parameters ofan unremoved feature interaction item) in the model can be moreaccurate.

Optionally, in an embodiment, in step S320, optimization is performed onthe architecture parameters, so that the optimized architectureparameters are sparse and a value of an architecture parameter of atleast one feature interaction item is equal to zero after optimizationis completed, in step S330, the third model may be obtained in thefollowing plurality of manners.

Manner (1):

In step S330, feature interaction items corresponding to the optimizedarchitecture parameters may be directly used as selected featureinteraction items, and the third model is obtained based on the selectedfeature interaction items.

For example, in the first model, the feature interaction itemscorresponding to the optimized architecture parameters are used as theselected feature interaction items, and remaining feature interactionitems are deleted, to obtain the third model.

For another example, a model obtained through architecture parameteroptimization on the second model is directly used as the third model.

Manner (2):

In step S330, the feature interaction item corresponding to thearchitecture parameter in the optimized architecture parameters whosevalue is less than the threshold is deleted from feature interactionitems corresponding to the optimized architecture parameters, to obtainthe third model.

The threshold may be determined based on an actual applicationrequirement. For example, a value of the threshold may be obtainedthrough model training. A manner of obtaining the threshold is notlimited in this disclosure.

For example, in the first model, the third model is obtained by deletingthe feature interaction item other than the feature interaction itemscorresponding to the optimized architecture parameters and deleting thefeature interaction item corresponding to the architecture parameter inthe optimized architecture parameters whose value is less than thethreshold.

For another example, in the second model obtained through architectureparameter optimization, the third model is obtained by deleting thefeature interaction item corresponding to the architecture parameter inthe optimized architecture parameters whose value is less than thethreshold.

In embodiments of this disclosure, optimization on the architectureparameters allows some architecture parameters to tend to zero, which isequivalent to removing some unimportant feature interaction items in anarchitecture parameter optimization process. In other words,optimization on the architecture parameters implements architectureparameter optimization and feature interaction item selection. This canimprove efficiency of FIS and reduce a computing workload and computingpower consumption.

It can be learned from the foregoing description of step S320 that, instep S330, an implementation of obtaining the third model throughfeature interaction item selection may be determined based on anoptimization manner of the architecture parameters in step S320. Thefollowing describes implementations of obtaining the third model in thefollowing two cases.

In a first case, in step S320, optimization is performed on thearchitecture parameters, to allow the optimized architecture parametersto be sparse.

In step S330, the third model is obtained by deleting the featureinteraction item corresponding to the architecture parameter in theoptimized architecture parameters whose value is less than a threshold.For the threshold, refer to the foregoing description. Details are notdescribed herein again.

As an example, instead of a limitation, optimized architectureparameters obtained through architecture parameter optimization (namely,optimization convergence) are denoted as optimal values α* of thearchitecture parameters. Based on the optimal values α*, specificfeature interaction items that are to be retained or deleted aredetermined. For example, if an optimal value α*_((i,j)) of anarchitecture parameter of a feature interaction item reaches thethreshold, the feature interaction item should be retained, if anoptimal value α*_((i,j)) of an architecture parameter of a featureinteraction item is less than the threshold, the feature interactionitem should be deleted.

For example, in the second model, for each feature interaction item, aselection gate ψ_((i. j)) indicating whether the feature interactionitem is retained in a model is set. The second model may be expressed asthe following formula (9):

$\begin{matrix}{{\hat{y}(x)}:={w_{0} + {\sum\limits_{i = 1}^{m}{w_{i}x_{i}}} + {\sum\limits_{i = 1}^{m}{\sum\limits_{j = {i + 1}}^{m}{\alpha_{({i,j})}\psi_{({i,j})}\left\langle {v_{i},v_{j}} \right\rangle x_{i}{x_{j}.}}}}}} & (9)\end{matrix}$

A value of the switch item ψ_((i.j)) may be represented by using thefollowing formula (10):

$\begin{matrix}{\psi_{({i,j})} = \left\{ {\begin{matrix}1 & {{❘\alpha_{({i,j})}^{*}❘} \geq {thr}} \\0 & {{❘\alpha_{({i,j})}^{*}❘} < {thr}}\end{matrix}.} \right.} & (10)\end{matrix}$

thr indicates a threshold.

A feature interaction item whose switch item ψ_((i,j)) is 0 is deletedfrom the second model, to obtain the third model through featureinteraction item selection.

In this embodiment, setting of the switch item ψ_((i.j)) may beconsidered as a criterion for determining whether to retain a featureinteraction item.

Alternatively, the third model may be a model obtained through featureinteraction item deletion based on the first model.

For example, the feature interaction item whose switch item ψ_((i.j)) is0 is deleted from the first model, to obtain the third model throughfeature interaction item selection.

Alternatively, the third model may be a model obtained through featureinteraction item deletion based on the second model.

For example, the feature interaction item whose switch item ψ_((i.j)) is0 is deleted from the second model, to obtain the third model throughfeature interaction item selection.

It should be understood that, in this embodiment, the third model hasoptimized architecture parameters that represent importance of featureinteraction items. Subsequently, importance of the feature interactionitems can be further learned through training of the third model.

In a second case, in step S320, optimization is performed on thearchitecture parameters, so that the optimized architecture parametersare sparse and a value of an architecture parameter of at least onefeature interaction item is equal to zero after optimization iscompleted.

Optionally, in step S330, the third model is obtained by deleting thefeature interaction item other than the feature interaction itemscorresponding to the optimized architecture parameters.

In an example, in step S330, the third model is obtained by deleting thefeature interaction item other than the feature interaction itemscorresponding to the optimized architecture parameters in the firstmodel. In other words, the third model is obtained through featureinteraction item deletion based on the first model.

In another example, in step S330, the second model obtained througharchitecture parameter optimization is used as the third model. In otherwords, the third model is obtained through feature interaction itemdeletion based on the second model.

Optionally, in step S330, the third model is obtained by deleting thefeature interaction item other than the feature interaction itemscorresponding to the optimized architecture parameters and deleting thefeature interaction item corresponding to the architecture parameter inthe optimized architecture parameters whose value is less than thethreshold.

In an example, in step S330, the third model is obtained by deleting thefeature interaction item other than the feature interaction itemscorresponding to the optimized architecture parameters and deleting thefeature interaction item corresponding to the architecture parameter inthe optimized architecture parameters whose value is less than thethreshold in the first model. In other words, the third model isobtained through feature interaction item deletion based on the firstmodel.

In another example, in step S330, in the second model obtained througharchitecture parameter optimization, the third model is obtained bydeleting a feature interaction item corresponding to an architectureparameter in the optimized architecture parameters whose value is lessthan a threshold. In other words, the third model is obtained throughfeature interaction item deletion based on the second model.

It should be understood that, in an embodiment in which the third modelis obtained through feature interaction item deletion based on thesecond model, the third model has the optimized architecture parametersthat represent importance of the feature interaction items.Subsequently, importance of the feature interaction items can be furtherlearned through training of the third model.

It may be understood from the formula (4), the formula (5), or theformula (6) that the second model includes two types of parameters: anarchitecture parameter and a model parameter. The model parametersindicate weight parameters other than the architecture parameters of thefeature interaction item in the second model. For example, in an exampleof the second model expressed in the formula (4), α_((i,j)) indicatesarchitecture parameters of the feature interaction items, and

v_(i), v_(j)

indicates model parameters of the feature interaction items. Forexample, in an example of the second model expressed in the formula (5),α_((i,j)) indicates architecture parameters of the feature interactionitems, and (e₁, e₁) may indicate model parameters of the featureinteraction items.

It may be understood that, an architecture parameter optimizationprocess involves architecture parameter training and model parametertraining. In other words, optimization on the architecture parameters inthe second model in step S320 is accompanied by optimization on themodel parameters in the second model.

For example, in the embodiment shown in FIG. 3 , the method 300 furtherincludes performing optimization on model parameters in the secondmodel, where optimization includes scalarization processing on the modelparameters.

In each round of training in the model parameter optimization process,scalarization processing is performed on the model parameters in thesecond model.

For example, scalarization processing is performed on the modelparameters in the second model by performing BN on the model parametersin the second model.

For example, in an example of the second model expressed in the formula(5), the architecture parameters in the second model are optimized byusing the following formula (11):

$\begin{matrix}{\left\langle {e_{i},e_{j}} \right\rangle_{BN} = {\frac{\left\langle {e_{i},e_{j}} \right\rangle_{B} - {\mu_{B}\left( \left\langle {e_{i},e_{j}} \right\rangle_{B} \right)}}{\sqrt{{\sigma_{B}^{2}\left( \left\langle {e_{i},e_{j}} \right\rangle_{B} \right)} + \theta}}.}} & (11)\end{matrix}$

e_(i), e_(j)

_(BN) indicates BN of

e_(i), e_(j)

.

e_(i), e_(j)

_(B) indicates mini-batch data of

e_(i), e_(j)

.

μ_(B) (

e_(i), e_(j)

_(B)) indicates an average value of mini-batch data of

e_(i), e_(j)

.

σ_(B) ²(

e_(i), e_(j)

_(B)) indicates a variance of mini-batch data of

e_(i), e_(j)

.

θ indicates disturbance.

Still refer to FIG. 4 . BN shown in FIG. 4 indicates BN processing onthe model parameters in the second model.

Scalarization processing is performed on the model parameters of thefeature interaction items, to decouple the model parameters from thearchitecture parameters of the feature interaction items. In this case,the architecture parameters can more accurately reflect importance ofthe feature interaction items, further improving optimization accuracyof the architecture parameters. This is explained as follows.

It should be understood that e_(i) is continuously updated and changedin a model training process. After inner product is performed on e_(i)and e_(j), in other words,

e_(i), e_(j)

, a scale of the inner product is constantly updated. It is assumed thatα_((i,j))

e_(i), e_(j)

may be obtained through

${\left( \frac{\alpha_{({i,j})}}{\eta} \right)\left( {\eta \cdot \left\langle {e_{i},e_{j}} \right\rangle} \right)},$

where a first term

$\left( \frac{\alpha_{({i,j})}}{\eta} \right)$

is coupled to a second term (η·

e_(i), e_(j)

). If the second item (η·

e_(i), e_(j)

) is not scalarized, the first item

$\left( \frac{\alpha_{({i,j})}}{\eta} \right)$

cannot absolutely represent importance of the second item, causing greatinstability to a system.

In this embodiment of this disclosure, scalarization processing isperformed on the model parameters of the feature interaction item, sothat α_((i,j))

e_(i), e_(j)

cannot be obtained through

${\left( \frac{\alpha_{({i,j})}}{\eta} \right)\left( {\eta \cdot \left\langle {e_{i},e_{j}} \right\rangle} \right)},$

in other words, the model parameters of the feature interaction item canbe decoupled from the architecture parameters.

The model parameters of the feature interaction item are decoupled fromthe architecture parameters, so that the architecture parameters canmore accurately reflect importance of the feature interaction items,further improving optimization accuracy of the architecture parameters.

In other words, scalarization processing is performed on the modelparameters of the feature interaction items, to decouple the modelparameters from the architecture parameters of the feature interactionitems, so that there is no coupling effect between the model parametersand the architecture parameters of the feature interaction items tocause large instability in the system.

As described above, the second model includes two types of parameters:the architecture parameters and the model parameters. An architectureparameter optimization process involves architecture parameter trainingand model parameter training. In other words, optimization on thearchitecture parameters in the second model in step S320 is accompaniedby optimization on the model parameters in the second model.

For ease of understanding and description, in the following description,an architecture parameter in the second model is denoted as α, and amodel parameter in the second model is denoted as w (corresponding to

v_(i), v_(j)

in the formula (4)).

Optionally, in the embodiment shown in FIG. 3 , optimization processingon the architecture parameter α in the second model and optimizationprocessing on the model parameter w in the second model includetwo-level optimization processing on the architecture parameter α andthe model parameter w in the second model.

In other words, in step S320, two-level optimization processing isperformed on the architecture parameter α and the model parameter w inthe second model, to obtain the optimized architecture parameter α* .

In this embodiment, the architecture parameter α in the second model isused as a model hyperparameter for optimization, and the model parameterw in the second model is used as a model parameter for optimization. Inother words, the architecture parameter α is used as a high-leveldecision variable, and the model parameter w is used as a low-leveldecision variable. Any value of the high-level decision variable acorresponds to a different model.

Optionally, when a model corresponding to any value of the high-leveldecision variable a is evaluated, an optimal model parameter w optimalis obtained through entire training of the model. In other words, eachtime a candidate value of the architecture parameter α is evaluated,entire training of a model corresponding to the candidate value isperformed.

Optionally, when a model corresponding to any value of the high-leveldecision variable a is evaluated, wt+1 obtained by updating the model inone step by using mini-batch data is used to replace the optimal modelparameter w optimal.

Optionally, in the embodiment shown in FIG. 3 , optimization processingon the architecture parameter α in the second model and optimizationprocessing on the model parameter w in the second model includesimultaneous optimization on both the architecture parameter α and themodel parameter w in the second model by using same training data.

In other words, in step S320, simultaneous optimization processing isperformed on both the architecture parameter α and the model parameter win the second model, to obtain the optimized architecture parameter α*by using the same training data.

In this embodiment, in each round of training in an optimizationprocess, simultaneous optimization is performed on both the architectureparameter α and the model parameter w based on a same batch of trainingdata. Alternatively, the architecture parameter and the model parameterin the second model are considered as decision variables at a samelevel, and simultaneous optimization is performed on both thearchitecture parameter α and the model parameter w in the second model,to obtain the optimized architecture parameter α*.

In this embodiment, optimization processing performed on thearchitecture parameter α and the model parameter w in the second modelmay be referred to as one-level optimization processing.

For example, the architecture parameter α in the second model and themodel parameter w freely explore their feasible fields in stochasticgradient descent (SGD) optimization until convergence.

For example, the architecture parameter α and the model parameter w inthe second model are optimized by using the following formula (12):

α_(t)=α_(t-1)−η_(t)·∂_(α) L _(train)(w _(t-1),α_(t-1))

w _(t) =w _(t-1)−δ_(t)·∂_(w) L _(train)(w _(t-1),α_(t-1))  (12).

α_(t) indicates an architecture parameter after optimization in step tis performed. α_(t-1) indicates an architecture parameter afteroptimization in step t−1 is performed. w_(t) indicates a model parameterafter optimization in step t is performed. w_(t-1) indicates a modelparameter after optimization in step t−1 is performed. η_(t) indicatesan optimization rate of an architecture parameter during optimization instep t. δ_(t) indicates a learning rate of a model parameter duringoptimization in step t. L_(train)(w_(t-1), α_(t-1)) indicates a lossfunction value of a loss function on a test set during optimization instep t. θ_(α)L_(train)(w_(t-1), α_(t-1)) indicates a gradient of theloss function on the test set relative to the architecture parameter αduring optimization in step t. ∂_(w)L_(rain)(w_(t-1), α_(t-1)) indicatesa gradient of the loss function on the test set relative to the modelparameter w during optimization in step t.

In this embodiment, one-level optimization processing is performed onthe architecture parameters and the model parameters in the secondmodel, to implement optimization on the architecture parameters in thesecond model, so that the architecture parameters and the modelparameters can be simultaneously optimized. Therefore, time consumed inan optimization process of the architecture parameters in the secondmodel can be reduced, to further help improve efficiency of featureinteraction item selection.

After step S330 is completed, in other words, feature interaction itemselection is completed, the third model is a model obtained throughfeature interaction item selection.

In step S340, the third model is trained.

The third model may be trained, or the third model may be trained aftera L1 regular term and/or a L2 regular term are/is added to the thirdmodel.

An objective of training the third model may be determined based on anapplication requirement.

For example, assuming that a CTR prediction model is to be obtained, thethird model is trained by using the obtained CTR prediction model as thetraining objective, to obtain the CTR prediction model.

For another example, assuming that a CVR prediction model is to beobtained, the third model is trained by using the CVR prediction modelas the training objective, to obtain the CVR prediction model.

Alternatively, the third model is a model obtained through featureinteraction item deletion based on the first model. For details, referto the foregoing description of step S330. Details are not describedherein again.

Alternatively, the third model is a model obtained through featureinteraction item deletion based on the second model. For details, referto the foregoing description of step S330. Details are not describedherein again.

It should be understood that through feature interaction item deletion(or selection), the architecture parameters are retained in the model totrain the model, so that importance of the feature interaction item canbe further learned.

It can be learned from the foregoing description that, in embodiments ofthis disclosure, the architecture parameters are introduced into theFM-based model, so that feature interaction item selection can beperformed through optimization on the architecture parameters. In otherwords, in this disclosure, feature interaction item selection can beperformed through optimization on the architecture parameters, andtraining for a plurality of candidate subsets in the conventionaltechnology is not required. Therefore, this can effectively reduce acomputing workload of FIS to save computing power, and improveefficiency of FIS.

In addition, in the solution provided in this embodiment of thisdisclosure, the feature interaction item in the FM-based model can beextended to a higher order.

FIG. 5 is another schematic flowchart of an automatic FIS method 500according to an embodiment of this disclosure.

First, training data is obtained.

For example, assuming that a quantity of features is m, the trainingdata is obtained for features of m fields.

S510: Enumerate and enter feature interaction items into an FM-basedmodel.

The FM-based model may be the FM model shown in the foregoing formula(1) or formula (2), or may be any one of the following FM-based models:a DeepFM model, an IPNN model, an AFM model, and an NFM model.

The enumerating and entering feature interaction items into an FM-basedmodel means building, based on all interactions of m features, featureinteraction items based on an FM model.

It should be understood that when the feature interaction items arebeing built, auxiliary vectors of m features are involved.

For example, the embedding layer shown in FIG. 1 or FIG. 3 may be usedto obtain the auxiliary vectors of the m features. A technology ofobtaining the auxiliary vectors of m features through the embeddinglayer belongs to a conventional technology. Details are not described inthis specification.

S520: Introduce architecture parameters to the FM-based model. Further,one coefficient parameter is added to each feature interaction item inthe FM-based model, and the coefficient parameter is referred to as anarchitecture parameter.

Step S520 is corresponding to step S310 in the foregoing embodiment. Forspecific descriptions, refer to the foregoing description.

The FM-based model in the embodiment shown in FIG. 5 is corresponding tothe first model in the embodiment shown in FIG. 3 , and a model obtainedby adding an architecture parameter to the FM-based model in theembodiment shown in FIG. 5 is corresponding to the second model in theembodiment shown in FIG. 3 .

S530: Perform optimization on the architecture parameters untilconvergence, to obtain the optimized architecture parameters.

Step S530 is corresponding to step S320 in the foregoing embodiment. Forspecific descriptions, refer to the foregoing description.

S540: Perform feature interaction item deletion based on the optimizedarchitecture parameters, to obtain a model through feature interactionitem deletion.

Step S540 is corresponding to step S330 in the foregoing embodiment. Forspecific descriptions, refer to the foregoing description.

The model obtained through feature interaction item deletion in theembodiment shown in FIG. 5 is corresponding to the third model in theembodiment shown in FIG. 3 .

S550: Train the model obtained through feature interaction item deletionuntil convergence, to obtain a CTR prediction model.

Step S550 is corresponding to step S340 in the foregoing embodiment. Forspecific descriptions, refer to the foregoing description.

After the CTR prediction model is obtained through training, onlineinference may be performed on the CTR prediction model.

For example, data of a target object is input into the CTR predictionmodel, and the CTR prediction model outputs a CTR of the target object.Whether to recommend the target object may be determined based on theCTR.

An automatic FIS solution provided in this embodiment of this disclosuremay be applied to any FM-based model, for example, an FM model, a DeepFMmodel, an IPNN model, an AFM model, and an NFM model.

In an example, the automatic FIS solution provided in this embodiment ofthis disclosure may be applied to an existing FM model.

For example, the architecture parameters are introduced into theexisting FM model, so that importance of each feature interaction itemis obtained through optimization on the architecture parameter. Then FISis performed based on the importance of each feature interaction item,to finally obtain an FM model through FIS.

It should be understood that, the solution in this disclosure is appliedto the FM model, so that feature interaction item selection of the FMmodel can be efficiently performed, to support extending the featureinteraction item of the FM model to a higher order.

In another example, the automatic FIS solution provided in thisembodiment of this disclosure may be applied to an existing DeepFMmodel.

For example, the architecture parameters are introduced into theexisting DeepFM model, so that importance of each feature interactionitem is obtained through optimization on the architecture parameter.Then FIS is performed based on the importance of each featureinteraction item, to finally obtain a DeepFM through FIS.

It should be understood that, the solution in this disclosure is appliedto the DeepFM model, so that feature interaction item selection of theDeepFM model can be efficiently performed.

As shown in FIG. 6 , this embodiment of this disclosure further providesa data processing method 600. The method 600 includes the followingsteps: S610 and S620.

S610: Input data of a target object into a CTR prediction model or a CVRprediction model, to obtain a prediction result of a target object.

For example, the target object is a commodity.

S620: Determine a recommendation status of the target object based onthe prediction result of the target object.

The CTR prediction model or the CVR prediction model is obtained throughthe method 300 provided in the foregoing embodiment, that is, the CTRprediction model or the CVR prediction model is obtained through stepS310 to step S340 in the foregoing embodiment. Refer to the foregoingdescription. Details are not described herein again.

In step S340, a third model is trained by using a training sample of thetarget object, to obtain the CTR prediction model or the CVR predictionmodel.

Optionally, in step S320, simultaneous optimization is performed on botharchitecture parameters and model parameters in a second model by usingthe same training data as that in the training sample of the targetobject, to obtain the optimized architecture parameters.

Alternatively, the architecture parameters and the model parameters inthe second model are considered as decision variables at a same level,and simultaneous optimization is performed on both the architectureparameters and the model parameters in the second model by using thetraining sample of the target object, to obtain the optimizedarchitecture parameters.

Simulation testing: CTR prediction accuracy of online A/B testing issignificantly improved, and inference energy consumption is greatlyreduced.

As an example, simulation testing shows that when the FIS solutionprovided in this disclosure is applied to the DeepFM model of arecommender system and online A/B testing is performed, a game downloadrate can be increased by 20%, a CTR prediction accuracy rate can berelatively improved by 20.3%, and a CVR can be relatively improved by20.1%. Therefore, a model inference speed can be effectively improved.

In an example, an FM model and a DeepFM model are obtained on a publicdataset Avazu by using the solution provided in this disclosure. Resultsof comparing performance of the FM model and the DeepFM model obtainedby using the solution in this disclosure with performance of anothermodel in the industry are shown in Table 1 and Table 2. Table 1indicates comparison of second-order models, and Table 2 indicatescomparison of third-order models. In the second-order mode, the highestorder of a feature interaction item in a model is second order. In thethird-order mode, the highest order of a feature interaction item in amodel is third order.

TABLE 1 Public dataset Avazu Time in Search + re-train Model AUC Logloss Top seconds (s) cost in minutes (min) Rel.Impr FM 0.7793 0.3805100% 0.51 0 + 3 0 FwFM 0.7822 0.3784 100% 0.52 0 + 4 0.37% AFM 0.78060.3794 100% 1.92 0 + 14 0.17% Field-Aware 0.7831 0.3781 100% 0.24 0 + 60.49% FM (FFM) DeepFM 0.7836 0.3776 100% 0.76 0 + 6 0.55% GBDT + LR0.7721 0.3841 100% 0.45 8 + 3 −0.92% GBDT + FFM 0.7835 0.3777 100% 2.666 + 21 0.54% AutoFM (2nd) 0.7831 0.3778  29% 0.23 4 + 2 0.49% AutoDeepFM0.7852 0.3765  24% 0.48 7 + 4 0.76% (2nd)

TABLE 2 Public dataset Avazu Search + re-train Model AUC Log loss TopTime (s) cost (min) Rel.Impr FM (3rd) 0.7843 0.3772 100% 5.70  0 + 210.64% DeepFM (3rd) 0.7854 0.3765 100% 5.97  0 + 23 0.78% AutoFM (3rd)0.7860 0.3762 25%/2%  0.33 22 + 5 0.86% AutoDeepFM (3rd) 0.7870 0.375621%/10% 0.94 24 + 10 0.99%

In Table 1 and Table 2, meanings of horizontal table headers are asfollows.

AUC represent area under curve which indicates an area under a curve.Log loss indicates log of a loss value. Top indicates a proportion offeature interaction items retained through feature interaction itemselection. Time indicates a time period for a model to infer two millionsamples. Search+re-train cost indicates a time period consumed forsearch and retraining, where a time period consumed for search indicatesa time period consumed for step S320 and step S330 in the foregoingembodiment, and a time period consumed for retraining indicates a timeperiod consumed for step S340 in the foregoing embodiment. Rel.Imprindicates a relative increase value.

In Table 1, meanings of vertical table headers are as follows.

FM, Field-weighted FM (FwFM), AFM, FFM, and DeepFM represent FM-basedmodels in the conventional technology. gradient boosting decision tree(GBDT)+logistical regression (LR) and GBDT+FFM indicate models that usemanual FIS in the conventional technology.

AutoFM (2nd) represents a second-order FM model obtained by using thesolution provided in this embodiment of this disclosure. AutoDeepFM(2nd) represents a third-order DeepFM model obtained by using thesolution provided in this embodiment of this disclosure.

In Table 2, meanings of vertical table headers are as follows.

FM (3rd) represents a third-order FM model in the conventionaltechnology. DeepFM (3rd) represents a third-order DeepFM model in theconventional technology.

AutoFM (3rd) represents a third-order FM model obtained by using thesolution provided in this embodiment of this disclosure. AutoDeepFM(3rd) represents a third-order DeepFM model obtained by using thesolution provided in this embodiment of this disclosure.

It can be learned from Table 1 and Table 2 that, compared with theconventional technology, CTR prediction performed by using the FM modelor the DeepFM model obtained in the solution provided in this embodimentof this disclosure can significantly improve CTR prediction accuracy,and can effectively reduce an inference time period and energyconsumption.

It can be learned from the foregoing description that, in embodiments ofthis disclosure, the architecture parameters are introduced into theFM-based model, so that feature interaction item selection can beperformed through optimization on the architecture parameters. In otherwords, in this disclosure, provided that optimization on thearchitecture parameters is performed once, feature interaction itemselection can be performed, and training for a plurality of candidatesubsets in the conventional technology is not required. Therefore, thiscan effectively reduce a computing workload of FIS to save computingpower, and improve efficiency of FIS.

In addition, in the solution provided in this embodiment of thisdisclosure, the feature interaction item in the FM-based model can beextended to a higher order.

Embodiments described in this specification may be independentsolutions, or may be combined based on internal logic. All thesesolutions fall within the protection scope of this disclosure.

The foregoing describes the method embodiments provided in thisdisclosure, and the following describes apparatus embodiments providedin this disclosure. It should be understood that descriptions ofapparatus embodiments correspond to the descriptions of the methodembodiments. Therefore, for content that is not described in detail,refer to the foregoing method embodiments. For brevity, details are notdescribed herein again.

As shown in FIG. 7 , this embodiment of this disclosure further providesa data processing apparatus 700. The apparatus 700 includes thefollowing units.

A first processing unit 710 is configured to add an architectureparameter to each feature interaction item in a first model, to obtain asecond model, where the first model is an FM-based model, and thearchitecture parameter represents importance of a corresponding featureinteraction item.

A second processing unit 720 is configured to perform optimization onarchitecture parameters in the second model, to obtain the optimizedarchitecture parameters.

A third processing unit 730 is configured to obtain, based on theoptimized architecture parameters and the first model or the secondmodel, a third model through feature interaction item deletion.

Optionally, the second processing unit 720 performs optimization on thearchitecture parameters, to allow the optimized architecture parametersto be sparse.

In this embodiment, the third processing unit 730 is configured toobtain, based on the first model or the second model, the third model bydeleting a feature interaction item corresponding to an architectureparameter in the optimized architecture parameters whose value is lessthan a threshold.

Optionally, the second processing unit 720 performs optimization on thearchitecture parameters, to allow a value of an architecture parameterof at least one feature interaction item to be equal to zero afteroptimization is completed.

For example, the third processing unit 730 is configured to optimize thearchitecture parameters in the second model using a gRDA optimizer,where the gRDA optimizer allows the value of the architecture parameterof the at least one feature interaction item to tend to zero during anoptimization process.

Optionally, the second processing unit 720 is further configured toperform optimization on model parameters in the second model, whereoptimization includes scalarization processing on the model parametersin the second model.

For example, the second processing unit 720 is configured to perform BNprocessing on the model parameters in the second model.

Optionally, the second processing unit 720 is configured to performsimultaneous optimization on both the architecture parameters and modelparameters in a second model by using same training data, to obtain theoptimized architecture parameters.

Optionally, the apparatus 700 further includes a training unit 740configured to train the third model.

Optionally, the training unit 740 is configured to train the thirdmodel, to obtain a CTR prediction model or a CVR prediction model.

The apparatus 700 may be integrated into a terminal device, a networkdevice, or a chip.

The apparatus 700 may be deployed on a compute node of a related device.

As shown in FIG. 8 , this embodiment of this disclosure further providesan image processing apparatus 800. The apparatus 800 includes thefollowing units.

A first processing unit 810 is configured to input data of a targetobject into a CTR prediction model or a CVR prediction model, to obtaina prediction result of the target object.

A second processing unit 820 is configured to determine a recommendationstatus of the target object based on the prediction result of the targetobject.

The CTR prediction model or the CVR prediction model is obtained throughthe method 300 or 500 in the foregoing embodiments.

Training of a third model includes the following step: train the thirdmodel by using a training sample of the target object, to obtain the CTRprediction model or the CVR prediction model.

Optionally, optimization on architecture parameters includes thefollowing step: perform simultaneous optimization on both thearchitecture parameters and model parameters in a second model by usingthe same training data as that in the training sample of the targetobject, to obtain the optimized architecture parameters.

The apparatus 800 may be integrated into a terminal device, a networkdevice, or a chip.

The apparatus 800 may be deployed on a compute node of a related device.

As shown in FIG. 9 , this embodiment of this disclosure further providesan image processing apparatus 900. The apparatus 900 includes aprocessor 910, the processor 910 is coupled to a memory 920, the memory920 is configured to store a computer program or instructions, and theprocessor 910 is configured to execute the computer program or theinstructions stored in the memory 920, so that the method in theforegoing method embodiments is performed.

Optionally, as shown in FIG. 9 , the apparatus 900 may further include amemory 920.

Optionally, as shown in FIG. 9 , the apparatus 900 may further include adata interface 930, where the data interface 930 is configured totransmit data to the outside.

Optionally, in a solution, the apparatus 900 is configured to implementthe method 300 in the foregoing embodiment.

Optionally, in another solution, the apparatus 900 is configured toimplement the method 500 in the foregoing embodiment.

Optionally, in still another solution, the apparatus 900 is configuredto implement the method 600 in the foregoing embodiment.

An embodiment of this disclosure further provides a computer-readablemedium. The computer-readable medium stores program code to be executedby a device, and the program code is used to perform the method in theforegoing embodiments.

An embodiment of this disclosure further provides a computer programproduct including instructions. When the computer program product is runon a computer, the computer is enabled to perform the method in theforegoing embodiments.

An embodiment of this disclosure further provides a chip, and the chipincludes a processor and a data interface. The processor reads, throughthe data interface, instructions stored in a memory to perform themethod in the foregoing embodiments.

Optionally, in an implementation, the chip may further include a memoryand the memory stores instructions, the processor is configured toexecute the instructions stored in the memory, and when the instructionsare executed, the processor is configured to perform the method in theforegoing embodiments.

An embodiment of this disclosure further provides an electronic device.The electronic device includes any one or more of the apparatus 700, theapparatus 800, or the apparatus 900 in the foregoing embodiments.

FIG. 10 is a schematic diagram of a hardware architecture of a chipaccording to an embodiment of this disclosure. The chip includes aneural-network processing unit 1000. The chip may be disposed in any oneor more of the following apparatuses or systems: the apparatus 700 shownin FIG. 7 , the apparatus 800 shown in FIG. 8 , and the apparatus 900shown in FIG. 9 .

The method 300, 500, or 600 in the foregoing method embodiments may beimplemented in the chip shown in FIG. 10 .

The neural-network processing unit 1000 serves as a coprocessor, and isdisposed on a host CPU. The host CPU assigns a task. A core part of theneural-network processing unit 1000 is an operational circuit 1003, anda controller 1004 controls the operational circuit 1003 to obtain datain a memory (a weight memory 1002 or an input memory 1001) and performan operation.

In some implementations, the operational circuit 1003 includes aplurality of processing engines (PE). In some implementations, theoperational circuit 1003 is a two-dimensional systolic array.Alternatively, the operational circuit 1003 may be a one-dimensionalsystolic array or another electronic circuit that can performmathematical operations such as multiplication and addition. In someimplementations, the operational circuit 1003 is a general-purposematrix processor.

For example, it is assumed that there are an input matrix A, a weightmatrix B, and an output matrix C. The operational circuit 1003 extractscorresponding data of the matrix B from a weight memory 1002, andbuffers the corresponding data into each PE in the operational circuit1003. The operational circuit 1003 fetches data of the matrix A from aninput memory 1001, to perform a matrix operation on the matrix B, andstores an obtained partial result or an obtained final result of thematrix into an accumulator 1008.

A vector calculation unit 1007 may perform further processing such asvector multiplication, vector addition, an exponent operation, alogarithmic operation, or value comparison on output of the operationalcircuit 1003. For example, the vector calculation unit 1007 may beconfigured to perform network calculation, such as pooling, batchnormalization, or local response normalization at anon-convolutional/non-fully connected (FC) layer in a neural network.

In some implementations, the vector calculation unit 1007 can store aprocessed output vector in a unified memory (or a unified buffer) 1006.For example, the vector calculation unit 1007 may apply a non-linearfunction to the output of the operational circuit 1003, for example, avector of an accumulated value, to generate an activation value. In someimplementations, the vector calculation unit 1007 generates a normalizedvalue, a combined value, or both a normalized value and a combinedvalue. In some implementations, the processed output vector can be usedas an activation input for the operational circuit 1003, for example,used in a subsequent layer in the neural network.

The method 300, 500, or 600 in the foregoing method embodiments may beperformed by 1003 or 1007.

The unified memory 1006 is configured to store input data and outputdata.

For weight data, a direct memory access controller (DMAC) 1005 directlytransfers input data in an external memory to the input memory 1001and/or the unified memory 1006, stores weight data in the externalmemory in the weight memory 1002, and stores data in the unified memory1006 in the external memory.

A bus interface unit (BIU) 1010 is configured to implement interactionbetween the host CPU, the DMAC, and an instruction fetch buffer 1009 byusing a bus.

The instruction fetch buffer 1009 connected to the controller 1004 isconfigured to store an instruction used by the controller 1004.

The controller 1004 is configured to invoke the instruction cached inthe instruction fetch buffer 1009, to control a working process of anoperation accelerator.

In this embodiment of this disclosure, the data herein may beto-be-processed image data.

Generally, the unified memory 1006, the input memory 1001, the weightmemory 1002, and the instruction fetch buffer 1009 each are an on-chipmemory. The external memory is a memory outside the NPU. The externalmemory may be a double data rate (DDR) synchronous dynamic random-accessmemory (SDRAM), a high bandwidth memory (HBM), or another readable andwritable memory.

Unless otherwise defined, all technical and scientific terms used inthis specification have same meanings as that usually understood by aperson skilled in the art of this disclosure. The terms used in thespecification of this disclosure are merely for the purpose ofdescribing specific embodiments, and are not intended to limit thisdisclosure.

It should be noted that “first”, “second”, “third”, or “fourth”, andvarious numbers in this specification are merely used fordifferentiation for ease of description, and are not construed as alimitation to the scope of this disclosure.

A person skilled in the art may be aware that units and algorithm stepsin the examples described with reference to the embodiments disclosed inthis specification can be implemented by electronic hardware or aninteraction of computer software and electronic hardware. Whether thefunctions are performed by hardware or software depends on particularapplications and design constraints of the technical solutions. A personskilled in the art may use different methods to implement the describedfunctions for each particular application, but it should not beconsidered that the implementation goes beyond the scope of thisdisclosure.

It may be clearly understood by a person skilled in the art that, forthe purpose of convenient and brief description, for a detailed workingprocess of the foregoing system, apparatus, and unit, refer to acorresponding process in the foregoing method embodiments, and detailsare not described herein again.

In the several embodiments provided in this disclosure, it should beunderstood that the disclosed systems, apparatuses, and methods may beimplemented in another manner. For example, the described apparatusembodiment is merely an example. For example, division into the units ismerely logical function division and may be other division in actualimplementation. For example, a plurality of units or components may becombined or integrated into another system, or some features may beignored or not performed. In addition, the displayed or discussed mutualcouplings or direct couplings or communications connections may beimplemented through some interfaces. The indirect couplings orcommunications connections between the apparatuses or units may beimplemented in an electrical form, a mechanical form, or other forms.

The units described as separate parts may or may not be physicallyseparate. Parts displayed as units may or may not be physical units, andmay be located in one position or may be distributed on a plurality ofnetwork units. Some or all of the units may be selected based on actualrequirements to achieve the objectives of the solutions in embodiments.

In addition, functional units in embodiments of this disclosure may beintegrated into one processing unit, each of the units may exist alonephysically, or two or more units may be integrated into one unit.

When the functions are implemented in a form of a software functionalunit and sold or used as an independent product, the functions may bestored in a computer-readable storage medium. Based on such anunderstanding, the technical solutions of this disclosure essentially,or the part contributing to the conventional technology, or some of thetechnical solutions may be implemented in a form of a software product.The computer software product is stored in a storage medium, andincludes several instructions for instructing a computer device (whichmay be a personal computer, a server, a network device, or the like) toperform all or some of the steps of the methods described in theembodiments of this disclosure. The foregoing storage medium includesany medium that can store program code, such as a Universal Serial Bus(USB) flash disk (UFD) (or a USB flash drive or a flash memory), aremovable hard disk, a read-only memory (ROM), a random-access memory(RAM), a magnetic disk, or a compact disc. The UFD may also be brieflyreferred to as a USB flash drive or a USB flash drive.

The foregoing descriptions are merely specific implementations of thisdisclosure, but are not intended to limit the protection scope of thisdisclosure. Any variation or replacement readily figured out by a personskilled in the art within the technical scope disclosed in thisdisclosure shall fall within the protection scope of this disclosure.Therefore, the protection scope of this disclosure shall be subject tothe protection scope of the claims.

1. A management method implemented by a transaction processing system,wherein the management method comprises: obtaining a first quantity oftransactions accessing each combination of data partitions in a cycle;and adjusting, based on the first quantity, storage of the datapartitions on a plurality of participant nodes.
 2. The management methodof claim 1, wherein obtaining the first quantity comprises collecting,in the cycle, the first quantity.
 3. The management method of claim 1,wherein obtaining the first quantity comprises: obtaining a secondquantity of transactions accessing each combination of data partitionsin a history cycle; and predicting, based on the second quantity, thefirst quantity.
 4. The management method of claim 1, wherein adjustingthe storage comprises: determining, based on the first quantity, atarget combination of data partitions; and migrating one or more datapartitions in the target combination from N participant nodes to Mparticipant nodes, and wherein N is greater than M.
 5. The managementmethod of claim 1, wherein adjusting the storage comprises adjusting,based on the first quantity and a load of the participant nodes, thestorage.
 6. A device, comprising: a memory configured to storeinstructions; and a processor coupled to the memory and configured toexecute the instructions to: obtain a first quantity of transactionsaccessing each combination of data partitions in a cycle; and adjust,based on the first quantity, storage of the data partitions on aplurality of participant nodes.
 7. The device of claim 6, wherein theprocessor is further configured to obtain the first quantity bycollecting, in the cycle, the first quantity.
 8. The device of claim 6,wherein the processor is further configured to: obtain a second quantityof transactions accessing each combination of data partitions in ahistory cycle; and predict, based on the second quantity, the firstquantity.
 9. The device of claim 6, wherein the processor is furtherconfigured to execute the instructions to: determine, based on the firstquantity, a target combination of data partitions; and migrate one ormore data partitions in the target combination of data partitions from Nparticipant nodes to M participant nodes, and wherein N is greater thanM.
 10. The device of claim 6, wherein the processor is furtherconfigured to execute the instructions to adjust, based on the firstquantity and a load of the participant nodes, the storage.
 11. Acomputer program product comprising instructions stored on anon-transitory computer-readable medium that, when executed by aprocessor, cause a device to: obtain a first quantity of transactionsaccessing each combination of data partitions in a cycle; and adjust,based on the first quantity, storage of the data partitions on aplurality of participant nodes.
 12. The computer program product ofclaim 11, wherein the instructions further cause the device to obtainthe first quantity by collecting, in the cycle, the first quantity. 13.The computer program product of claim 11, wherein the instructionsfurther cause the device to: obtain a second quantity of transactionsaccessing each combination of data partitions in a history cycle; andpredict, based on the second quantity, the first quantity.
 14. Thecomputer program product of claim 11, wherein the instructions furthercause the device to: determine, based on the first quantity, a targetcombination of data partitions; and migrate one or more data partitionsin the target combination from N participant nodes to M participantnodes, and wherein N is greater than M.
 15. The computer program productof claim 11, wherein the instructions further cause the device toadjust, based on the first quantity and a load of the participant nodes,the storage.
 16. The computer program product of claim 11, wherein theinstructions further cause the device to reduce a quantity of theparticipant nodes participating in one of the transactions.
 17. Thecomputer program product of claim 11, wherein the instructions furthercause the device to adjust, based on the first quantity, a load of theparticipant nodes, and a load balance policy, the storage.
 18. Thecomputer program product of claim 11, wherein the instructions furthercause the device to adjust, based on the first quantity and a loadbalance policy, the storage.
 19. The computer program product of claim11, wherein the instructions further cause the device to migrate thedata partitions to a same node.
 20. The computer program product ofclaim 11, wherein the instructions further cause the device to migratethe data partitions according to a pattern of repetition in a fixed timeperiod.